Table of Contents

Introduction

1. Importing Libraries +Bring in Data

1.1 Importing Libraries

1.2 Loading Data

1.3 Description of the Time-Series Dataset

1.4 Description of the Qualitative Dataset

1.5 Missing values

1.6 Merging Datasets

1.7 Datetime features

1.8 Outliers Detection and Treatment

1.9 Stationary Test

1.10 Autocorrelation Plots

1.11 Encoding Categorical Variables

1.12 Splitting Data

1.13 Creating Lag and Rolling features

2.0 Model Selection

2.1 XGBOOST Model

2.2 Forecasting Target Variables

2.21 Target Variable- Total Inside Sales

2.22 Target Variable- Total Food Sales

2.23 Target Variable- Diesel Sales

2.24 Target Variable- Unleaded Sales

2.25 Cross Validation

3.0 The Other Models

3.1 Prophet Model

3.1 Target Variable- Total Inside Sales

3.2 Target Variable- Total Food Sales

3.3 Target Variable- Diesel Sales

3.4 Target Variable- Unleaded Sales

4.0 Linear Regression

5.0 Random Forest

6.0 SARIMA

7.0 Summary

8.0 Conclusion

9.0 Contributions

9.1 Daryle Bilog

9.2 Joe Sarnello

9.3 Sanskriti Bhargava

9.4 Vinay Kumar Vascuri

Introduction

In the fast-paced world of retail, businesses often experience the challenge of balancing high growth ambitions with the need for accurate forecasts to make well-informed financial decisions. Maverik, a thriving retail brand, is no exception. With plans to open or build 30 new stores each year, Maverik's high-growth trajectory requires precise first-year sales forecasts. These forecasts serve as the cornerstone for building effective financial plans, optimizing resource allocation, and providing a benchmark to assess store performance against projected outcomes. The success of this endeavor hinges on two primary factors: achieving a Return on Investment (ROI) that closely matches forecasted ROI and consistently attaining accurate sales metric forecasts as new sales data becomes available.

To address this critical business challenge, Maverik has embarked on a data-driven journey, combining the power of qualitative and time series data to develop a sophisticated time series forecasting model in Python. This model leverages qualitative insights from recent new store openings, network-wide seasonality patterns, and a diverse set of sales metrics. It will employ a variety of machine learning algorithms to generate daily-level forecasts, enabling Maverik to make precise financial and operational decisions.

The scope of this project encompasses the end-to-end development of the time series forecasting model. It will incorporate qualitative data, network-wide seasonality patterns, and other relevant factors to deliver accurate forecasts for the first year of sales for new stores.

This series of modeling tasks will explore the utilization of machine learning techniques to create a forecasting model tailored to Maverik's unique needs. It aims to provide Maverik with the insights and tools necessary to maintain its impressive growth while making data-driven decisions that foster long-term success.

1. Importing Libraries + Bring in Data¶

1.1 Importing Libraries¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.model_selection import train_test_split
In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [ ]:
import os
os.chdir('/content/drive/MyDrive/maverik-capstone') # changing the default directory

1.2 Loading Data¶

In [ ]:
qualitative_data_path = 'qualitative_data_msba.csv'
df_qdm = pd.read_csv(qualitative_data_path, index_col=0)

time_series_data_path = 'time_series_data_msba.csv'
df_tsdm = pd.read_csv(time_series_data_path, index_col=0)

1.3 Description of Time-Series Dataset

In [ ]:
df_tsdm.head()
Out[ ]:
capital_projects.soft_opening_date calendar.calendar_day_date calendar.fiscal_week_id_for_year calendar.day_of_week calendar_information.holiday calendar_information.type_of_day daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba
1 6/14/2022 6/17/2022 25 Friday NONE WEEKDAY 2168.2920 861.6930 722.7745 1425.1020 24535
2 6/14/2022 6/22/2022 25 Wednesday NONE WEEKDAY 2051.5635 808.0275 730.4850 1436.2740 24535
3 6/14/2022 6/23/2022 25 Thursday NONE WEEKDAY 2257.5000 966.4410 895.7970 1594.3725 24535
4 6/14/2022 6/26/2022 26 Sunday NONE WEEKEND 1520.5925 542.3250 584.2900 1124.9280 24535
5 6/14/2022 6/27/2022 26 Monday NONE WEEKDAY 1897.6930 771.4525 852.2605 1640.2540 24535
In [ ]:
df_tsdm.columns
Out[ ]:
Index(['capital_projects.soft_opening_date', 'calendar.calendar_day_date',
       'calendar.fiscal_week_id_for_year', 'calendar.day_of_week',
       'calendar_information.holiday', 'calendar_information.type_of_day',
       'daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service',
       'diesel', 'unleaded', 'site_id_msba'],
      dtype='object')
In [ ]:
df_tsdm.shape
Out[ ]:
(13908, 11)
In [ ]:
df_tsdm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13908 entries, 1 to 13908
Data columns (total 11 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   capital_projects.soft_opening_date  13908 non-null  object 
 1   calendar.calendar_day_date          13908 non-null  object 
 2   calendar.fiscal_week_id_for_year    13908 non-null  int64  
 3   calendar.day_of_week                13908 non-null  object 
 4   calendar_information.holiday        13908 non-null  object 
 5   calendar_information.type_of_day    13908 non-null  object 
 6   daily_yoy_ndt.total_inside_sales    13908 non-null  float64
 7   daily_yoy_ndt.total_food_service    13908 non-null  float64
 8   diesel                              13908 non-null  float64
 9   unleaded                            13908 non-null  float64
 10  site_id_msba                        13908 non-null  int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 1.3+ MB
In [ ]:
df_tsdm.describe()
Out[ ]:
calendar.fiscal_week_id_for_year daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel unleaded site_id_msba
count 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000 13908.000000
mean 26.501079 2846.537988 759.922326 1702.585227 2382.091588 23041.052632
std 14.998715 981.299870 341.578220 2161.208192 1025.518658 710.634218
min 1.000000 0.000000 0.000000 0.000000 240.180500 21560.000000
25% 14.000000 2181.156250 521.087875 383.062750 1654.149000 22540.000000
50% 26.000000 2693.976250 697.434500 1018.920000 2256.677500 22907.500000
75% 39.000000 3325.306250 924.282625 2283.297625 2928.254000 23555.000000
max 52.000000 7172.466000 2531.662000 20853.952000 8077.233500 24535.000000

1.4 Description of Qualitative Dataset

In [ ]:
df_qdm.head()
Out[ ]:
open_year square_feet front_door_count years_since_last_project parking_spaces lottery freal bonfire_grill pizza cinnabon ... rv_lanes_fueling_positions_2 hi_flow_rv_lanes_layout hi_flow_rv_lanes_stack_type non_24_hour self_check_out mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
1 2021 5046 2 2 38 Yes Yes Yes No No ... 6 Stack HF/RV No Yes 2 2 6 2 21560
2 2021 5046 2 2 39 No Yes Yes Yes No ... 4 Combo HF/RV No Yes 5 5 10 4 21980
3 2021 5046 2 2 35 Yes Yes Yes Yes No ... 5 In-Line None No Yes 3 2 4 1 22015
4 2021 5046 2 2 36 No Yes Yes Yes No ... 4 Combo HF/RV No Yes 3 3 6 2 22085
5 2021 5046 2 2 25 Yes Yes Yes No No ... 0 NaN NaN No Yes 0 0 0 0 22120

5 rows × 54 columns

In [ ]:
df_qdm.columns
Out[ ]:
Index(['open_year', 'square_feet', 'front_door_count',
       'years_since_last_project', 'parking_spaces', 'lottery', 'freal',
       'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza',
       'ethanol_free', 'diesel', 'hi_flow_lanes', 'rv_lanes',
       'hi_flow_rv_lanes', 'def', 'cat_scales', 'car_wash', 'ev_charging',
       'rv_dumps', 'propane', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income',
       'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop',
       'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc',
       'traditional_forecourt_fueling_positions',
       'traditional_forecourt_layout', 'traditional_forecourt_stack_type',
       'rv_lanes_fueling_positions', 'rv_lanes_layout', 'rv_lanes_stack_type',
       'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_layout',
       'hi_flow_lanes_stack_type', 'hi_flow_lanes_fueling_positions_2',
       'rv_lanes_fueling_positions_2', 'hi_flow_rv_lanes_layout',
       'hi_flow_rv_lanes_stack_type', 'non_24_hour', 'self_check_out',
       'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count',
       'womens_sink_count', 'site_id_msba'],
      dtype='object')
In [ ]:
df_qdm.shape
Out[ ]:
(37, 54)
In [ ]:
df_qdm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 1 to 37
Data columns (total 54 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   open_year                                37 non-null     int64 
 1   square_feet                              37 non-null     int64 
 2   front_door_count                         37 non-null     int64 
 3   years_since_last_project                 37 non-null     int64 
 4   parking_spaces                           37 non-null     int64 
 5   lottery                                  37 non-null     object
 6   freal                                    37 non-null     object
 7   bonfire_grill                            37 non-null     object
 8   pizza                                    37 non-null     object
 9   cinnabon                                 37 non-null     object
 10  godfather_s_pizza                        37 non-null     object
 11  ethanol_free                             37 non-null     object
 12  diesel                                   37 non-null     object
 13  hi_flow_lanes                            37 non-null     object
 14  rv_lanes                                 37 non-null     object
 15  hi_flow_rv_lanes                         37 non-null     object
 16  def                                      37 non-null     object
 17  cat_scales                               37 non-null     object
 18  car_wash                                 37 non-null     object
 19  ev_charging                              37 non-null     object
 20  rv_dumps                                 37 non-null     object
 21  propane                                  37 non-null     object
 22  x1_mile_pop                              37 non-null     int64 
 23  x1_mile_emp                              37 non-null     int64 
 24  x1_mile_income                           37 non-null     int64 
 25  x1_2_mile_pop                            37 non-null     int64 
 26  x1_2_mile_emp                            37 non-null     int64 
 27  x1_2_mile_income                         37 non-null     int64 
 28  x5_min_pop                               37 non-null     int64 
 29  x5_min_emp                               37 non-null     int64 
 30  x5_min_inc                               37 non-null     int64 
 31  x7_min_pop                               37 non-null     int64 
 32  x7_min_emp                               37 non-null     int64 
 33  x7_min_inc                               37 non-null     int64 
 34  traditional_forecourt_fueling_positions  37 non-null     int64 
 35  traditional_forecourt_layout             37 non-null     object
 36  traditional_forecourt_stack_type         37 non-null     object
 37  rv_lanes_fueling_positions               37 non-null     int64 
 38  rv_lanes_layout                          23 non-null     object
 39  rv_lanes_stack_type                      23 non-null     object
 40  hi_flow_lanes_fueling_positions          37 non-null     int64 
 41  hi_flow_lanes_layout                     22 non-null     object
 42  hi_flow_lanes_stack_type                 22 non-null     object
 43  hi_flow_lanes_fueling_positions_2        37 non-null     int64 
 44  rv_lanes_fueling_positions_2             37 non-null     int64 
 45  hi_flow_rv_lanes_layout                  23 non-null     object
 46  hi_flow_rv_lanes_stack_type              23 non-null     object
 47  non_24_hour                              37 non-null     object
 48  self_check_out                           37 non-null     object
 49  mens_toilet_count                        37 non-null     int64 
 50  mens_urinal_count                        37 non-null     int64 
 51  womens_toilet_count                      37 non-null     int64 
 52  womens_sink_count                        37 non-null     int64 
 53  site_id_msba                             37 non-null     int64 
dtypes: int64(27), object(27)
memory usage: 15.9+ KB
In [ ]:
df_qdm.describe()
Out[ ]:
open_year square_feet front_door_count years_since_last_project parking_spaces x1_mile_pop x1_mile_emp x1_mile_income x1_2_mile_pop x1_2_mile_emp ... traditional_forecourt_fueling_positions rv_lanes_fueling_positions hi_flow_lanes_fueling_positions hi_flow_lanes_fueling_positions_2 rv_lanes_fueling_positions_2 mens_toilet_count mens_urinal_count womens_toilet_count womens_sink_count site_id_msba
count 37.000000 37.00000 37.0 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
mean 2021.324324 4970.27027 2.0 1.648649 37.405405 6703.567568 4757.648649 53300.378378 1833.108108 1514.135135 ... 14.270270 2.513514 3.324324 3.324324 2.513514 2.378378 2.351351 4.648649 1.702703 23040.405405
std 0.474579 575.93121 0.0 0.483978 5.918237 5694.011350 4697.168291 24333.027254 1915.140476 2489.423094 ... 3.948619 2.049683 2.925501 2.925501 2.049683 0.923500 0.856875 1.751447 0.740303 730.069801
min 2021.000000 2933.00000 2.0 1.000000 23.000000 0.000000 56.000000 0.000000 0.000000 31.000000 ... 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 21560.000000
25% 2021.000000 5046.00000 2.0 1.000000 34.000000 1984.000000 1771.000000 39538.000000 262.000000 386.000000 ... 12.000000 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 4.000000 1.000000 22540.000000
50% 2021.000000 5046.00000 2.0 2.000000 38.000000 5574.000000 3895.000000 46356.000000 1003.000000 1037.000000 ... 12.000000 4.000000 5.000000 5.000000 4.000000 2.000000 2.000000 4.000000 2.000000 22890.000000
75% 2022.000000 5046.00000 2.0 2.000000 41.000000 11269.000000 6002.000000 73519.000000 2686.000000 1616.000000 ... 16.000000 4.000000 5.000000 5.000000 4.000000 3.000000 3.000000 6.000000 2.000000 23555.000000
max 2022.000000 6134.00000 2.0 2.000000 49.000000 18692.000000 26077.000000 110957.000000 5923.000000 15403.000000 ... 24.000000 6.000000 9.000000 9.000000 6.000000 5.000000 5.000000 10.000000 4.000000 24535.000000

8 rows × 27 columns

1.5 Missing Values

In [ ]:
# Calculate the number of missing values for each column in the qualitative DataFrame
missing_qualitative = df_qdm.isnull().sum()

# Calculate the number of missing values for each column in the time series DataFrame
missing_time_series = df_tsdm.isnull().sum()
In [ ]:
# Show columns from the qualitative DataFrame that have any missing values
missing_qualitative[missing_qualitative > 0]
Out[ ]:
rv_lanes_layout                14
rv_lanes_stack_type            14
hi_flow_lanes_layout           15
hi_flow_lanes_stack_type       15
hi_flow_rv_lanes_layout        14
hi_flow_rv_lanes_stack_type    14
dtype: int64
In [ ]:
# Check unique values in columns with missing values in qualitative DataFrame
print('rv_lanes_layout Unique values: ', df_qdm['rv_lanes_layout'].unique())
print('rv_lanes_stack_type Unique values: ', df_qdm['rv_lanes_stack_type'].unique())
print('hi_flow_lanes_layout Unique values: ', df_qdm['hi_flow_lanes_layout'].unique())
print('hi_flow_lanes_stack_type Unique values: ', df_qdm['hi_flow_lanes_stack_type'].unique())
print('hi_flow_rv_lanes_layout Unique values: ', df_qdm['hi_flow_rv_lanes_layout'].unique())
print('hi_flow_rv_lanes_stack_type Unique values: ', df_qdm['hi_flow_rv_lanes_stack_type'].unique())
rv_lanes_layout Unique values:  ['Stack' 'In-Line' nan]
rv_lanes_stack_type Unique values:  ['HF/RV' 'None' nan]
hi_flow_lanes_layout Unique values:  ['Stack' 'Combo' nan]
hi_flow_lanes_stack_type Unique values:  ['HF/RV' nan]
hi_flow_rv_lanes_layout Unique values:  ['Stack' 'Combo' 'In-Line' nan]
hi_flow_rv_lanes_stack_type Unique values:  ['HF/RV' 'None' nan]
In [ ]:
# Fill all Missing values in the qualitative DataFrame with a new category 'None'
df_qdm = df_qdm.fillna('None')
In [ ]:
# Show columns from the qualitative DataFrame that have any missing values
missing_time_series[missing_time_series > 0]
Out[ ]:
Series([], dtype: int64)
  1. Since the missing values in the qualitative DataFrame are mostly categorical and can be interpreted as not present. We are imputing a new 'None' Category in place of null values.
  2. There are no missing values in the time-series DataFrame

1.6 Merging Datasets

In [ ]:
merged_df = pd.merge(df_tsdm, df_qdm, on='site_id_msba', how='inner')

1.7 Datetime features

In [ ]:
# Convert date columns to datetime format
merged_df['calendar.calendar_day_date'] = pd.to_datetime(merged_df['calendar.calendar_day_date'])
merged_df['capital_projects.soft_opening_date'] = pd.to_datetime(merged_df['capital_projects.soft_opening_date'])

# Sort Values by site id and calendar day date
merged_df = merged_df.sort_values(['site_id_msba', 'calendar.calendar_day_date'])

# Create new fetaures based on the existing date features
merged_df['day_of_week'] = merged_df['calendar.calendar_day_date'].dt.dayofweek
merged_df['calendar_year'] = merged_df['calendar.calendar_day_date'].dt.year
merged_df['calendar_month'] = merged_df['calendar.calendar_day_date'].dt.month
merged_df['calendar_day'] = merged_df['calendar.calendar_day_date'].dt.day

1.8 Outliers Detection and Treatment

In [ ]:
# Function to detect outliers
def detect_outliers_iqr(data, feature):
    Q1 = data[feature].quantile(0.25)
    Q3 = data[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)]
    return outliers, lower_bound, upper_bound

target_features = ["daily_yoy_ndt.total_inside_sales", "daily_yoy_ndt.total_food_service", "diesel_x", "unleaded"]
outliers_info = {}

# Detecting outliers for each target feature
for feature in target_features:
    outliers, lower_bound, upper_bound = detect_outliers_iqr(merged_df, feature)
    outliers_info[feature] = {
        "outliers_count": outliers.shape[0],
        "lower_bound": lower_bound,
        "upper_bound": upper_bound,
        "outliers_index": outliers.index
    }

# Outliers information for each target variable
outliers_summary = pd.DataFrame(outliers_info).T[['outliers_count', 'lower_bound', 'upper_bound']]
outliers_summary
Out[ ]:
outliers_count lower_bound upper_bound
daily_yoy_ndt.total_inside_sales 490 451.423875 5068.637875
daily_yoy_ndt.total_food_service 517 -84.042 1541.967
diesel_x 640 -2480.291625 5198.477375
unleaded 326 -304.937062 4904.032438
In [ ]:
# Capping the outliers to the upper and lower bounds for each target feature
for feature in target_features:
    lower_bound = outliers_info[feature]['lower_bound']
    upper_bound = outliers_info[feature]['upper_bound']

    merged_df[feature] = merged_df[feature].clip(lower=lower_bound, upper=upper_bound)

# Check the data to ensure the capping was applied
merged_df_describe = merged_df[target_features].describe()

merged_df_describe
Out[ ]:
daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel_x unleaded
count 13542.000000 13542.000000 13542.000000 13542.000000
mean 2831.571136 756.807500 1557.357496 2371.155283
std 924.481068 313.037819 1410.406522 978.457952
min 451.423875 0.000000 12.498500 240.180500
25% 2182.879125 525.711375 399.246750 1648.426500
50% 2694.347250 705.759250 1083.677000 2263.154250
75% 3337.182625 932.213625 2318.939000 2950.668875
max 5068.637875 1541.967000 5198.477375 4904.032438

A considerable number of outliers were found for each of the target variables when the IQR method's 1.5 multiplier was used for outlier detection. This finding raises the possibility of either a significant variation in the data or a high sensitivity of this method to the features of the dataset. With a large range between the lower and upper bounds, the descriptive statistics performed prior to capping suggested that these outliers might have an impact on the overall profile of the dataset. Following capping, the dataset was shown to have less variability and possibly more stability for further analysis based on the lower standard deviation and the alignment of maximum values with the upper bounds. During the modeling process, we can determine if there is a significant effect of outliers on the residuals.

1.9 Stationarity Test

In [ ]:
# Function to perform the Augmented Dickey-Fuller test
def adf_test(series, feature_name):
    result = adfuller(series, autolag='AIC')
    adf_statistic = result[0]
    p_value = result[1]
    critical_values = result[4]
    return {
        "Feature": feature_name,
        "ADF Statistic": adf_statistic,
        "p-value": p_value,
        "Critical Values": critical_values,
        "Stationarity": p_value < 0.05  # If p-value < 0.05, we reject the null hypothesis (series is stationary)
    }

# Applying the ADF test on the target features
adf_results = pd.DataFrame([adf_test(merged_df[feature], feature) for feature in target_features])
adf_results[['Feature', 'ADF Statistic', 'p-value', 'Stationarity']]
Out[ ]:
Feature ADF Statistic p-value Stationarity
0 daily_yoy_ndt.total_inside_sales -5.972455 1.921862e-07 True
1 daily_yoy_ndt.total_food_service -4.959112 2.672214e-05 True
2 diesel_x -5.228578 7.684673e-06 True
3 unleaded -5.464509 2.476066e-06 True

All variables show evidence of stationarity, according to the results of the Augmented Dickey-Fuller(ADF) test applied to the target features, as each variable's test statistics are below the critical values and its p-value is significantly below the 0.05 threshold.

This implies that these variables time series data do not have unit roots and are stable over time, indicating that their statistical characteristics such as variance and mean are not time-dependent.

1.10 Autocorrelation Plots

In [ ]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

# Funtion to plot Autocorrelation plots
def plot_auto_correlation_plots(target):
# Create subplots with 1 row and 2 columns
  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 4))

  # Plot Autocorrelation Function (ACF)
  plot_acf(merged_df[target], ax=axes[0])
  axes[0].set_title(f'Autocorrelation Function for {target}')
  axes[0].set_xlabel('Lags')  # Set x-axis label
  axes[0].set_ylabel('Partial Autocorrelation')

  # Plot Partial Autocorrelation Function (PACF)
  plot_pacf(merged_df[target], ax=axes[1])
  axes[1].set_title(f'Partial Autocorrelation Function for {target}')
  axes[1].set_xlabel('Lags')  # Set x-axis label
  axes[1].set_ylabel('Partial Autocorrelation')

  # Show the plots
  plt.tight_layout()
  plt.show()

# Iterate through the target features to plot the autocorrelation plots
target_features = ["daily_yoy_ndt.total_inside_sales", "daily_yoy_ndt.total_food_service", "diesel_x", "unleaded"]
for target in target_features:
  plot_auto_correlation_plots(target)

The ACF plots for both "daily_yoy_ndt.total_inside_sales" and "daily_yoy_ndt.total_food_service" indicate a strong positive correlation at lag 1 that gradually declines, yet remains significant across many lags, suggesting non-stationarity and a potential trend or seasonal pattern in the data. The PACF plot for "total_inside_sales" shows a significant spike at lag 1 and then cuts off, whereas the "total_food_service" PACF plot suggests a less clear AR structure with significant lags occurring at various intervals. Both datasets are likely to benefit from differencing to remove the non-stationary components and may require seasonal adjustment given the repeating patterns in the ACF. The autocorrelation function (ACF) plots for both "diesel_x" and "unleaded" show a very gradual decline in the correlations as lags increase, which implies a strong persistent autocorrelation over time, indicative of a non-stationary time series.

1.11 Encoding Categorical variables

In [ ]:
# Create a list of categorical columns
categorical_columns = list(merged_df.select_dtypes(include=['object']).columns)
In [ ]:
# One-Hot Encoding for categorical variables
one_hot_encoder = OneHotEncoder(sparse=False, drop='first')
encoded_categorical_columns = pd.DataFrame(
    one_hot_encoder.fit_transform(merged_df[categorical_columns]),
    index=merged_df.index
)

# Join the encoded columns back to the dataframe
merged_df = merged_df.join(encoded_categorical_columns).drop(columns=categorical_columns)

# Display head for encoded DataFrame
merged_df.head()
Out[ ]:
capital_projects.soft_opening_date calendar.calendar_day_date calendar.fiscal_week_id_for_year daily_yoy_ndt.total_inside_sales daily_yoy_ndt.total_food_service diesel_x unleaded site_id_msba open_year square_feet ... 48 49 50 51 52 53 54 55 56 57
13447 2021-01-12 2021-01-12 2 2036.2685 762.8530 1424.1850 1522.0030 21560 2021 5046 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
13448 2021-01-12 2021-01-13 2 2379.8880 1003.7930 1303.8445 1853.7715 21560 2021 5046 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
13449 2021-01-12 2021-01-14 2 2435.8600 974.2250 1375.6785 2122.4070 21560 2021 5046 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
13359 2021-01-12 2021-01-15 3 2805.9780 911.0115 1334.9175 2267.9930 21560 2021 5046 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
13450 2021-01-12 2021-01-16 3 2314.7635 715.7535 831.1625 1819.6395 21560 2021 5046 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0

5 rows × 96 columns

1.12 Splitting Data into Train and Test

In [ ]:
# Get a list of unique stores
unique_stores_list = merged_df['site_id_msba'].unique()

# Split the stores into training and testing sets
train_stores, test_stores = train_test_split(unique_stores_list, test_size=5, random_state=42)  # Using a fixed random state for reproducibility

# Split the data into training and testing based on the stores
train_data_og = merged_df[merged_df['site_id_msba'].isin(train_stores)]
test_data_og = merged_df[merged_df['site_id_msba'].isin(test_stores)]


# Let's check the number of entries in the training and testing sets
len_train = len(train_data_og)
len_test = len(test_data_og)
(len_train, len_test)
Out[ ]:
(11712, 1830)

1.13 Creating Lag and Rolling features

Lag and rolling features are commonly used in time-series analysis and forecasting. These features involve using historical data to generate new features that can provide insights into trends, patterns, or relationships within a dataset

In [ ]:
def create_lag_features(target_variable):
  # Creating lagged features for up to 7 days
  train_data = train_data_og.copy()
  test_data = test_data_og.copy()
  for lag in range(1, 8):
      train_data[f"{target_variable}_lag_{lag}"] = train_data[target_variable].shift(lag)
      test_data[f"{target_variable}_lag_{lag}"] = test_data[target_variable].shift(lag)

  # Creating rolling window features (7 days and 30 days)
  train_data[f'{target_variable}_rolling_7_mean'] = train_data[target_variable].rolling(window=7).mean()
  train_data[f'{target_variable}_rolling_30_mean'] = train_data[target_variable].rolling(window=30).mean()
  train_data[f'{target_variable}_rolling_7_std'] = train_data[target_variable].rolling(window=7).std()
  train_data[f'{target_variable}_rolling_30_std'] = train_data[target_variable].rolling(window=30).std()

  test_data[f'{target_variable}_rolling_7_mean'] = test_data[target_variable].rolling(window=7).mean()
  test_data[f'{target_variable}_rolling_30_mean'] = test_data[target_variable].rolling(window=30).mean()
  test_data[f'{target_variable}_rolling_7_std'] = test_data[target_variable].rolling(window=7).std()
  test_data[f'{target_variable}_rolling_30_std'] = test_data[target_variable].rolling(window=30).std()

  # The first few rows will have NaN values due to lagging and rolling, so we drop them.
  train_data.dropna(inplace=True)
  test_data.dropna(inplace=True)

  return train_data, test_data
In [ ]:
pd.set_option('display.max_colwidth', None)

The create_lag_features function generates lag and rolling features for a given target variable in time-series data. It starts by creating lagged features for the target variable spanning up to 7 days, both for the training and test datasets. Following that, rolling window features are computed, including the 7-day and 30-day rolling mean and standard deviation. This process enhances the dataset by capturing temporal patterns and trends within the time series. However, as lag and rolling operations introduce NaN values in the initial rows, the function concludes by removing these rows from both the training and test datasets. Overall, this function is designed to prepare the data for time-series analysis or forecasting by incorporating valuable historical information through lag and rolling window features.

In the previous section, we have created train_data_og and test_data_og by performing a train_test_split on the merged_df based on the site_id_msba. This means that, the train_data_og and test_data_og do not contain the same stores. For forecasting, we are again performing the train_test_split again on the train_data_og and training the model and testing on this split. Finally, we are using the 5 stores in the test_data_og as a validation set and forecasting their possible 1 year sales.¶

2.0 Model Selection

In our quest for forecasting sales of various stores, we systematically evaluated a diverse array of candidates, encompassing XGBoost, Linear Regression, Random Forest, SARIMA (Seasonal AutoRegressive Integrated Moving Average), and Prophet. Each of these models was chosen for its unique strengths and capabilities, contributing to a comprehensive exploration of potential solutions for our predictive analytics task.

XGBoost, a robust and flexible ensemble learning algorithm, was considered for its ability to handle complex relationships within the data and deliver accurate predictions. Linear Regression provided a fundamental baseline model, aiding in the interpretation of linear relationships within our dataset. Random Forest, known for its ensemble of decision trees, offered versatility and resilience to outliers, further enriching our model portfolio. SARIMA, as a dedicated time series model, was incorporated to capture potential temporal patterns and dependencies present in the data. Prophet, a specialized forecasting tool developed by Facebook, was included for its adeptness in handling time series data with strong seasonal components.

The final selection of the best model was based on a holistic evaluation of cross-validated performance metrics. Criteria such as MAE (Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error), and R2 (R-squared) were considered to identify the model that demonstrated superior predictive capabilities on both training and validation sets. The chosen model represents a balanced solution, leveraging the strengths of each candidate model within the context of our specific predictive analytics objectives.

2.1 XGBOOST Model

In [ ]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Function to train XGBoost model and make predictions
def train_and_forecast(train_data, test_data, target_variable):
    # Prepare features and target
    X_train = train_data.drop(columns=['daily_yoy_ndt.total_inside_sales',
                              'daily_yoy_ndt.total_food_service',
                              'diesel_x',
                              'unleaded',
                              'site_id_msba',
                              'calendar.calendar_day_date',
                              'capital_projects.soft_opening_date'],axis=1)
    y_train = train_data[target_variable]
    X_test = test_data.drop(columns=['daily_yoy_ndt.total_inside_sales',
                              'daily_yoy_ndt.total_food_service',
                              'diesel_x',
                              'unleaded',
                              'site_id_msba',
                              'calendar.calendar_day_date',
                              'capital_projects.soft_opening_date'],axis=1)
    y_test = test_data[target_variable]

    # Initialize the XGBoost regressor
    model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

    # Train the model using the training data
    model.fit(X_train, y_train)

    # Predict on the training data and the test data
    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)

    # Calculate the performance metrics
    train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
    test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
    train_r2 = r2_score(y_train, train_predictions)
    test_r2 = r2_score(y_test, test_predictions)

    # Create a DataFrame for actual vs predicted values for the test set
    predictions_df = pd.DataFrame({
        'Actual': y_test,
        'Predicted': test_predictions
    })

    return model, predictions_df, (train_rmse, test_rmse, train_r2, test_r2)

# Forecast function to create additional features for the future dates and make predictions
def forecast_future(model, data, target_variable, periods=365):
    # Forecast for each store
    for store_id in data['site_id_msba'].unique():
        store_data = data[data['site_id_msba'] == store_id].copy()
        actual_dates = store_data['calendar.calendar_day_date']
        actual_store_data = store_data.copy()
        store_data.set_index('calendar.calendar_day_date', inplace=True,  drop = False)
        last_date = store_data['calendar.calendar_day_date'].max()
        future_dates = pd.date_range(start=last_date, periods=periods + 1, closed='right')
        future_data = pd.DataFrame(index=future_dates)
        forecasts = pd.DataFrame(index=future_dates)
        for date in future_dates:
            if date not in store_data.index:
              store_data = store_data.append(pd.Series(name=date))


            # Creating features for the date
            store_data['day_of_week'] = date.dayofweek
            store_data['calendar_year'] = date.year
            store_data['calendar_month'] = date.month
            store_data['calendar_day'] = date.day


            # Create lag and rolling features based on previous data
            for lag in range(1, 8):
                store_data[f"{target_variable}_lag_{lag}"] = store_data[target_variable].shift(lag)
            store_data[f'{target_variable}_rolling_7_mean'] = store_data[target_variable].rolling(window=7).mean()
            store_data[f'{target_variable}_rolling_30_mean'] = store_data[target_variable].rolling(window=30).mean()
            store_data[f'{target_variable}_rolling_7_std'] = store_data[target_variable].rolling(window=7).std()
            store_data[f'{target_variable}_rolling_30_std'] = store_data[target_variable].rolling(window=30).std()

            # Drop the rows with NaN values created by shifting and rolling
            if len(store_data.dropna()) >= 7:
              store_data.dropna(inplace=True)

            if not store_data.empty:
            # Prepare the last row (most recent data) as input for prediction
              X_future = store_data.drop(columns=['daily_yoy_ndt.total_inside_sales',
                                'daily_yoy_ndt.total_food_service',
                                'diesel_x',
                                'unleaded',
                                'site_id_msba',
                                'calendar.calendar_day_date',
                                'capital_projects.soft_opening_date'], axis=1).iloc[-1].to_frame().transpose()
              # Predict the target variable for the current date
              prediction = model.predict(X_future)[0]
              forecasts.loc[date, store_id] = prediction

              # Update the target variable with the prediction for the next prediction step
              store_data.loc[date, target_variable] = prediction


        forecasts['calendar.calendar_day_date'] = forecasts.index
        dates = actual_store_data['calendar.calendar_day_date'].append(forecasts['calendar.calendar_day_date'] )
        values = pd.concat([actual_store_data[target_variable], forecasts[store_id]])

        plt.figure(figsize=(14, 7))

        # Plot actual past data
        plt.plot(actual_store_data['calendar.calendar_day_date'], actual_store_data[target_variable], label='Actual', color='blue')

        # Find the forecast start date index for the plotting
        actual_store_data.set_index('calendar.calendar_day_date', inplace=True)
        forecast_start_index = actual_store_data.index.get_loc(last_date) + 1

        # Plot future forecasts
        plt.plot(dates[forecast_start_index:], values[forecast_start_index:], label='Forecast', color='orange', linestyle='--')

        plt.title(f'{target_variable} Forecast for Store {store_id}')
        plt.xlabel('Date')
        plt.ylabel('Sales')
        plt.legend()
        plt.grid(True)
        plt.show()


        forecasts.to_csv(f'/content/{store_id}_out.csv')
    return forecasts

The above script leverages the XGBoost machine learning model for time-series forecasting. The train_and_forecast function first trains the XGBoost model using historical training data, evaluating its performance on both the training and test sets. This function returns the trained model, a DataFrame comparing actual and predicted values, and key performance metrics such as root mean squared error (RMSE) and R-squared (R2).

The second function, forecast_future, utilizes the trained model to generate forecasts for future dates. It iterates over each store, creating features for future dates based on lag and rolling window operations. The predicted values are then incorporated into the dataset for subsequent predictions. The script concludes by visualizing the forecasted sales alongside actual data for each store, saving the forecasts to CSV files, and providing insights into future sales trends for decision-making purposes.

2.2 Forecasting Target Variables

2.21 Target Variable - Total Inside Sales

In [ ]:
# Train the model and forecast for the "total_inside_sales" target variable
target_variable = "daily_yoy_ndt.total_inside_sales"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)

# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)

# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")

forecasts.head()
Train Root Mean Squared Error (RMSE): 159.0069894762768
Test Root Mean Squared Error (RMSE): 275.9534972545057
Train R-squared (R2): 0.9713151651872407
Test R-squared (R2): 0.8807128859604085
Out[ ]:
24255 calendar.calendar_day_date
2023-08-10 2248.239990 2023-08-10
2023-08-11 2316.237793 2023-08-11
2023-08-12 2121.083496 2023-08-12
2023-08-13 1924.720337 2023-08-13
2023-08-14 2329.705322 2023-08-14

The forecasting model employed for predicting 'daily_yoy_ndt.total_inside_sales' across various stores showcases a commendable R2 score of 0.88, signifying that the model is capable of explaining a significant portion of the variance in the sales data. With a RMSE of 275.95, the predictions are in a reasonable range from the actual sales figures, although the train R2 and RMSE are higher than the test, possibly due to outliers, overfitting or unaccounted factors in the sales trends. Looking at the Forecast plots the model has difficulty capturing extreme values or spikes in sales, which could be due to many factors, including possible outliers, non-linear trends, or irregular patterns in the sales data that the model cannot account for.

2.22 Target Variable - Total Food Service Sales

In [ ]:
# Train the model and forecast for the "total_food_service" target variable
target_variable = "daily_yoy_ndt.total_food_service"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)

# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)

# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")

forecasts.head()
Train Root Mean Squared Error (RMSE): 51.945964715763616
Test Root Mean Squared Error (RMSE): 99.59938904171811
Train R-squared (R2): 0.9744054859512813
Test R-squared (R2): 0.7895828421958543
Out[ ]:
24255 calendar.calendar_day_date
2023-08-10 793.250610 2023-08-10
2023-08-11 795.345215 2023-08-11
2023-08-12 662.977905 2023-08-12
2023-08-13 615.353699 2023-08-13
2023-08-14 770.036804 2023-08-14

The performance metrics from the forecasting model for 'daily_yoy_ndt.total_food_service' suggest a model with reasonable predictive accuracy, as evidenced by an R2 score of 0.78, indicating that the model accounts for a substantial amount of variability in the food service sales data. The RMSE of 99.59 points to there is possibility of the model overfitting, given that the R2 score is very high in the train data and drops significantly in the test data. The plots reinforce that the model is not capturing the full range of variability in the sales data, particularly the peaks and valleys.

2.23 Target Variable - Diesel Sales

In [ ]:
# Train the model and forecast for the "diesel" target variable
target_variable = "diesel_x"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)

# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)

# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")

forecasts.head()
Train Root Mean Squared Error (RMSE): 156.02656293069364
Test Root Mean Squared Error (RMSE): 1381.8849593065484
Train R-squared (R2): 0.9869627683251245
Test R-squared (R2): 0.246121573472739
Out[ ]:
24255 calendar.calendar_day_date
2023-08-10 3157.135010 2023-08-10
2023-08-11 2913.590820 2023-08-11
2023-08-12 2451.014404 2023-08-12
2023-08-13 2374.418457 2023-08-13
2023-08-14 3065.634277 2023-08-14

The forecasting model for 'diesel_x' displays a poor R2 score of 0.24, indicating that the model's ability to predict diesel sales is weaker compared to the previous two targets. The RMSE is relatively high at 1381.88, suggesting that the model's sales predictions deviate substantially from the actual figures for the stores. The vast difference in RMSE among the stores indicates that the model struggles with consistency across different locations, likely due to location-specific factors that affect diesel sales and are not captured by the model. The forecast seems to predict constant sales, which does not reflect the actual sales trend. The model is likely overfitting to the training data and not capturing the underlying sales patterns effectively. This is evident from the high RMSE and low R2 on the test set, along with the discrepancy between the actual and forecasted sales in the plots.

2.24 Target Variable - Unleaded Sales

In [ ]:
# Train the model and forecast for the "unleaded" target variable
target_variable = "unleaded"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)

# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)

# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")

forecasts.head()
Train Root Mean Squared Error (RMSE): 165.74368076600695
Test Root Mean Squared Error (RMSE): 313.03822722415134
Train R-squared (R2): 0.9722252196888015
Test R-squared (R2): 0.8697455429007126
Out[ ]:
24255 calendar.calendar_day_date
2023-08-10 1625.639526 2023-08-10
2023-08-11 1654.871582 2023-08-11
2023-08-12 1370.459106 2023-08-12
2023-08-13 1340.163086 2023-08-13
2023-08-14 1605.919678 2023-08-14

The forecasting model's performance on 'unleaded' fuel sales reveals a mixed picture with an R2 score of 0.97, suggesting a fairly good fit where the model is able to explain a majority of the variance in sales. However, the RMSE of 313.53 points to considerable prediction errors across the dataset. The model still fails to capture the high variability and the spikes in the sales data which points to overfitting.

Actual vs Predicted for the Target Variables¶

In [ ]:
# OG Define a function to prepare the dataset for training
def prepare_training_data(data, store_ids, target_variable):
    # Filter the data for the specified stores
    data_filtered = data[data['site_id_msba'].isin(store_ids)]

    # Keep only the first year of data for each store
    data_first_year = data_filtered[data_filtered['calendar.calendar_day_date'] < data_filtered['capital_projects.soft_opening_date'] + pd.DateOffset(years=1)]

    # Drop the target variable and other non-feature columns
    X = data_first_year.drop(['daily_yoy_ndt.total_inside_sales',
                              'daily_yoy_ndt.total_food_service',
                              'diesel_x',
                              'unleaded',
                              'site_id_msba',
                              'calendar.calendar_day_date',
                              'capital_projects.soft_opening_date'], axis=1)

    # Target variable
    y = data_first_year[target_variable]

    # One-hot encode the categorical variables
    X = pd.get_dummies(X, drop_first=True)

    return X, y

# This function will prepare data for predictions in a similar way as the training data
def prepare_predict_data(data, store_id, target_variable):
    # Filter data for the specific store
    store_data = data[data['site_id_msba'] == store_id]

    # Ensure we use only the first year of data for prediction
    store_data_first_year = store_data[store_data['calendar.calendar_day_date'] < store_data['capital_projects.soft_opening_date'] + pd.DateOffset(years=1)]

    # Prepare features
    X = store_data_first_year.drop(['daily_yoy_ndt.total_inside_sales',
                                    'daily_yoy_ndt.total_food_service',
                                    'diesel_x',
                                    'unleaded',
                                    'site_id_msba',
                                    'calendar.calendar_day_date',
                                    'capital_projects.soft_opening_date'], axis=1)

    # One-hot encode the categorical variables
    X = pd.get_dummies(X, drop_first=True)

    return X, store_data_first_year[target_variable], store_data_first_year['calendar.calendar_day_date']

def forecast_model(target_variable):
  train_data, test_data = create_lag_features(target_variable)
  # Prepare the training data
  X_train, y_train = prepare_training_data(train_data, train_stores, target_variable)

  # Initialize the XGBRegressor
  if target_variable in ['diesel_x', 'unleaded']:
    model = XGBRegressor(n_estimators=1000, learning_rate=0.1, random_state=42, max_depth=3, subsample=1,  colsample_bytree=1)
  else:
    model = XGBRegressor(n_estimators=500, learning_rate=0.05, random_state=42)


  # Train the model on the combined dataset
  model.fit(X_train, y_train)

  # Dictionary to store the evaluation metrics for each store
  store_metrics = {}

  # Predict and evaluate for each of the prediction stores
  for store_id in test_stores:
      X_predict, y_true, dates = prepare_predict_data(test_data, store_id, target_variable)
      y_pred = model.predict(X_predict)
      mse = mean_squared_error(y_true, y_pred)
      rmse = mean_squared_error(y_true, y_pred, squared=False)
      r2 = r2_score(y_true, y_pred)
      store_metrics[store_id] = {'MSE': mse, 'RMSE': rmse, 'R2':r2}

      # Line plot to visualize the actual sales vs predicted sales
      plt.figure(figsize=(10, 4))
      plt.plot(dates, y_true, label='Actual')
      plt.plot(dates, y_pred, label='Predicted')
      plt.title(f'Actual vs Predicted Sales for {target_variable}')
      plt.xlabel('Date')
      plt.ylabel('Sales')
      plt.legend()
      plt.show()


  store_metrics_df = pd.DataFrame(store_metrics).T
  store_metrics_df.reset_index(inplace=True)
  store_metrics_df.rename(columns={'index': 'Store'}, inplace=True)

  # Print the Evaluation Metrics DataFrame
  print(store_metrics_df)

  # Visualizing the Evaluation metrics
  store_metrics_df['MSE'].plot(kind='bar', color='skyblue')
  plt.title('MSE for Each Store')
  plt.ylabel('MSE')
  plt.xlabel('Store')
  plt.show()

  store_metrics_df['RMSE'].plot(kind='bar', color='lightgreen')
  plt.title('RMSE for Each Store')
  plt.ylabel('RMSE')
  plt.xlabel('Store')
  plt.show()

  store_metrics_df['R2'].plot(kind='bar', color='salmon')
  plt.title('R2 for Each Store')
  plt.ylabel('R2')
  plt.xlabel('Store')
  plt.show()

  # Calulating average of the evaluation metrics of each predicted stores
  r2_sum = 0
  mse_sum = 0
  rmse_sum = 0
  for k, v in store_metrics.items():
    r2_sum+=v['R2']
    mse_sum+=v['MSE']
    rmse_sum+=v['RMSE']

  avg_r2 = r2_sum/len(store_metrics)
  avg_mse = mse_sum/len(store_metrics)
  avg_rmse = rmse_sum/len(store_metrics)

  print(" ")
  print(f'Average MSE:{avg_mse}, Average RMSE:{avg_rmse}, Average R2: {avg_r2}')
In [ ]:
# 'daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel_x', 'unleaded',
forecast_model('daily_yoy_ndt.total_inside_sales')
   Store            MSE        RMSE        R2
0  22855   55866.289154  236.360507  0.844885
1  22715   75070.442966  273.989859  0.853739
2  22120   33844.969969  183.970025  0.871654
3  23730   49058.454730  221.491433  0.917951
4  24255  122291.776472  349.702411  0.770049
 
Average MSE:67226.38665832648, Average RMSE:253.10284691658526, Average R2: 0.8516555612832315

The forecasting model employed for predicting 'daily_yoy_ndt.total_inside_sales' across various stores showcases a commendable average R2 score of 0.852, signifying that the model is capable of explaining a significant portion of the variance in the sales data. With an average RMSE of 253.41, the predictions are in a reasonable range from the actual sales figures, although the high MSE values suggest that there's notable variability in the predictions, possibly due to outliers or unaccounted factors in the sales trends.

In [ ]:
forecast_model('daily_yoy_ndt.total_food_service')
   Store           MSE        RMSE        R2
0  22855   3747.248983   61.214777  0.793677
1  22715  20210.060599  142.162093  0.685973
2  22120   4937.896722   70.270170  0.807351
3  23730   2914.887061   53.989694  0.918709
4  24255  10655.491057  103.225438  0.737013
 
Average MSE:8493.116884452824, Average RMSE:86.17243434970929, Average R2: 0.7885444390287065

The performance metrics from the forecasting model for 'daily_yoy_ndt.total_food_service' suggest a model with reasonable predictive accuracy, as evidenced by an average R2 score of 0.811, indicating that the model accounts for a substantial amount of variability in the food service sales data. The average RMSE of 78.69 points to there is possibility of the model overfitting.

In [ ]:
forecast_model('diesel_x')
   Store           MSE         RMSE        R2
0  22855  1.792002e+04   133.865667  0.610316
1  22715  5.897709e+06  2428.519966 -1.674923
2  22120  9.191024e+03    95.869828  0.540061
3  23730  3.412089e+04   184.718407  0.920933
4  24255  1.718907e+05   414.597082  0.824377
 
Average MSE:1226166.3789736975, Average RMSE:651.5141900365818, Average R2: 0.24415294415238337

The forecasting model for 'diesel_x' displays a moderate average R2 score of 0.517, indicating that the model's ability to predict diesel sales is relatively weaker compared to the previous two targets. The average RMSE is relatively high at 540.92, suggesting that the model's sales predictions deviate substantially from the actual figures for some stores. This is corroborated by the wide variance in performance metrics across different stores, with Store 23730 showing exemplary predictive accuracy as indicated by a high R2 score of 0.916 and the lowest RMSE, whereas Store 22715 exhibits a significantly poorer fit, reflected in a low R2 score and a very high RMSE. The vast difference in MSE among the stores indicates that the model struggles with consistency across different locations, likely due to location-specific factors that affect diesel sales and are not captured by the model.

In [ ]:
forecast_model('unleaded')
   Store            MSE        RMSE        R2
0  22855   45369.644576  213.001513  0.834743
1  22715  108351.982438  329.168623  0.556015
2  22120  708046.756729  841.455142 -0.975716
3  23730   25596.662752  159.989571  0.796447
4  24255  108734.336436  329.748899  0.676561
 
Average MSE:199219.87658610358, Average RMSE:374.67274969008656, Average R2: 0.3776100169391928

The forecasting model's performance on 'unleaded' fuel sales reveals a mixed picture with an average R2 score of 0.730, suggesting a fairly good fit where the model is able to explain a majority of the variance in sales. However, the average RMSE of 258.53 points to considerable prediction errors across the dataset. The MSE and RMSE are notably high for Store 22715, indicating substantial deviations from actual sales and the lowest R2 score, which suggests that the model's predictions are less reliable for this store.

2.25 Cross-Validation

Cross-validation is crucial for assessing the performance of machine learning models like XGBoost, particularly in time series forecasting tasks. The below logic implements time series cross-validation, which is essential for evaluating the model's robustness and generalization ability. In time series data, observations are often temporally correlated, and using conventional cross-validation methods may lead to data leakage and overestimation of the model's performance.

The forecast_model_cv function addresses this issue by employing the TimeSeriesSplit strategy, which ensures that each training set is a proper subset of the corresponding validation set in terms of time. This prevents the model from seeing future information during training, providing a more realistic evaluation. The calculated metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2), offer insights into how well the XGBoost model performs across different time periods and individual stores, aiding in the identification of potential overfitting or underfitting issues and guiding hyperparameter tuning for optimal forecasting results.

Here we are performing the 5-fold Cross-Validation

In [ ]:
def prepare_data(data, indices, target_variable):
    # Filter the data for the specified indices
    data_filtered = data.iloc[indices]

    data_first_year = data_filtered[data_filtered['calendar.calendar_day_date'] < data_filtered['capital_projects.soft_opening_date'] + pd.DateOffset(years=1)]


    # Prepare features and target
    X = data_first_year.drop([
        'daily_yoy_ndt.total_inside_sales',
        'daily_yoy_ndt.total_food_service',
        'diesel_x',
        'unleaded',
        'site_id_msba',
        'calendar.calendar_day_date',
        'capital_projects.soft_opening_date'
    ], axis=1)
    y = data_first_year[target_variable]

    # One-hot encode the categorical variables
    X = pd.get_dummies(X, drop_first=True)

    return X, y

def forecast_model_cv(target_variable, n_splits=5):

    # Create time series cross-validator object
    train_data, test_data = create_lag_features(target_variable)

    tscv = TimeSeriesSplit(n_splits=n_splits)

    # Initialize the XGBRegressor
    if target_variable in ['diesel_x']:
      model = XGBRegressor(n_estimators=1000, learning_rate=0.1, random_state=42, max_depth=3, subsample=1,  colsample_bytree=1)
    else:
      model = XGBRegressor(n_estimators=500, learning_rate=0.05, random_state=42)

    # Dictionary to store the evaluation metrics for each fold
    fold_metrics = {}

    # Perform time series cross-validation
    for fold, (train_index, test_index) in enumerate(tscv.split(train_data)):
        # Prepare the training and testing sets
        X_train, y_train = prepare_data(train_data, train_index, target_variable)
        X_test, y_test = prepare_data(train_data, test_index, target_variable)

        # Train the model on the current training set
        model.fit(X_train, y_train)

        # Predict on the current testing set
        y_pred = model.predict(X_test)

        # Evaluate the model
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        r2 = r2_score(y_test, y_pred)

        # Store the metrics for the current fold
        fold_metrics[fold] = {'MSE': mse, 'RMSE': rmse, 'R2': r2}

    # Print the metrics for each fold
    for fold in fold_metrics:
        print(f"Fold {fold}: MSE = {fold_metrics[fold]['MSE']}, RMSE = {fold_metrics[fold]['RMSE']}, R2 = {fold_metrics[fold]['R2']}")

    # Calculate the average performance across all folds
    avg_mse = sum(fm['MSE'] for fm in fold_metrics.values()) / n_splits
    avg_rmse = sum(fm['RMSE'] for fm in fold_metrics.values()) / n_splits
    avg_r2 = sum(fm['R2'] for fm in fold_metrics.values()) / n_splits

    print(f"Average MSE: {avg_mse}, Average RMSE: {avg_rmse}, Average R2: {avg_r2}")

    store_metrics = {}

    # Predict and evaluate for each of the prediction stores
    for store_id in test_stores:
        X_predict, y_true, dates = prepare_predict_data(test_data, store_id, target_variable)
        y_pred = model.predict(X_predict)
        mse = mean_squared_error(y_true, y_pred)
        rmse = mean_squared_error(y_true, y_pred, squared=False)
        r2 = r2_score(y_true, y_pred)
        store_metrics[store_id] = {'MSE': mse, 'RMSE': rmse, 'R2':r2}

    store_metrics_df = pd.DataFrame(store_metrics).T  # Transpose to have stores as rows
    store_metrics_df.reset_index(inplace=True)
    store_metrics_df.rename(columns={'index': 'Store'}, inplace=True)

    r2_sum = 0
    mse_sum = 0
    rmse_sum = 0
    for k, v in store_metrics.items():
      r2_sum+=v['R2']
      mse_sum+=v['MSE']
      rmse_sum+=v['RMSE']

    avg_r2 = r2_sum/len(store_metrics)
    avg_mse = mse_sum/len(store_metrics)
    avg_rmse = rmse_sum/len(store_metrics)
    print("--------Validation set results---------")
    print(f'Average MSE:{avg_mse}, Average RMSE:{avg_rmse}, Average R2: {avg_r2}')
In [ ]:
forecast_model_cv('daily_yoy_ndt.total_inside_sales')
Fold 0: MSE = 133129.85295525612, RMSE = 364.86963830285487, R2 = 0.6779509115851128
Fold 1: MSE = 881600.4979655384, RMSE = 938.9358327199673, R2 = 0.08185414340202524
Fold 2: MSE = 142604.70716958528, RMSE = 377.630384330479, R2 = 0.8184475550749276
Fold 3: MSE = 82640.91154668745, RMSE = 287.4733231913658, R2 = 0.9177036043351303
Fold 4: MSE = 61130.80111130455, RMSE = 247.24643801540307, R2 = 0.8684970950457372
Average MSE: 260221.35414967436, Average RMSE: 443.231123312014, Average R2: 0.6728906618885866
--------Validation set results---------
Average MSE:64620.72830267446, Average RMSE:248.6110640638843, Average R2: 0.8573949742972411

The above results & logic represents a time series cross-validation process for the XGBoost model applied to the target variable total_inside_sales.' The results for each fold of the cross-validation are displayed, showing the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.

The average performance across all folds is also calculated. The average MSE is 260,221.35, the average RMSE is 443.23, and the average R2 is 0.67, indicating the model's performance across different time periods.

The 'Validation set results' section presents the average metrics for individual stores, emphasizing the model's ability to generalize to new data. In this case, the average MSE is 64,620.73, the average RMSE is 248.61, and the average R2 is 0.86. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'total_inside_sales' variable, with lower MSE and RMSE values and a higher R2 indicating better predictive performance.

In [ ]:
forecast_model_cv('daily_yoy_ndt.total_food_service')
Fold 0: MSE = 9017.013041198976, RMSE = 94.95795407020402, R2 = 0.8303379614286998
Fold 1: MSE = 92707.64464306035, RMSE = 304.4793008450006, R2 = -0.022260001848123823
Fold 2: MSE = 10465.695345224807, RMSE = 102.30198114027317, R2 = 0.8977615373006135
Fold 3: MSE = 7223.355237594567, RMSE = 84.99032437633456, R2 = 0.938783156739038
Fold 4: MSE = 6482.7335148909, RMSE = 80.51542408067475, R2 = 0.8580003753970591
Average MSE: 25179.288356393925, Average RMSE: 133.4489969024974, Average R2: 0.7005246058034572
--------Validation set results---------
Average MSE:6054.571539000327, Average RMSE:76.0116459787697, Average R2: 0.8240671225169148

The above results logic represents a time series cross-validation process for the XGBoost model applied to the target variable total_food_service sales.' The results for each fold of the cross-validation are displayed, showing the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.

The average performance across all folds is also calculated. The average MSE is 25,179.29, the average RMSE is 133.45, and the average R2 is 0.70, indicating the model's performance across different time periods.

The 'Validation set results' section presents the average metrics for individual stores, emphasizing the model's ability to generalize to new data. In this case, the average MSE is 6,054.57, the average RMSE is 76.01, and the average R2 is 0.82. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'total_food_service' variable, with lower MSE and RMSE values and a higher R2 indicating better predictive performance.

In [ ]:
forecast_model_cv('diesel_x')
Fold 0: MSE = 99642.33831435276, RMSE = 315.66174667569834, R2 = 0.8882343908511142
Fold 1: MSE = 388133.99555090413, RMSE = 623.004009257488, R2 = 0.8560706812303035
Fold 2: MSE = 115119.18364533328, RMSE = 339.2921803480494, R2 = 0.9030030919437348
Fold 3: MSE = 48642.343487312006, RMSE = 220.55009292066055, R2 = 0.9704720673895306
Fold 4: MSE = 77605.24925618058, RMSE = 278.5771872501059, R2 = 0.9241587913070219
Average MSE: 145828.62205081654, Average RMSE: 355.41704329040044, Average R2: 0.908387804544341
--------Validation set results---------
Average MSE:1143801.9627692136, Average RMSE:638.3643855967982, Average R2: 0.2611097952178426

The provided logic demonstrates a time series cross-validation process for the XGBoost model applied to the target variable 'diesel_x.' The results for each fold of the cross-validation include the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.

The average performance across all folds is also computed. The average MSE is 145,828.62, the average RMSE is 355.42, and the average R2 is 0.91, indicating the model's overall effectiveness in forecasting 'diesel_x' across different time periods.

However, the 'Validation set results' section suggests that the model's performance varies across individual stores, as indicated by the considerably higher average MSE of 1,143,801.96, average RMSE of 638.36, and average R2 of 0.26 for the specific stores in the validation set. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'Diesel' variable, with the lower average MSE and RMSE values and higher R2 indicating better performance on the training data compared to the validation set.

In [ ]:
forecast_model_cv('unleaded')
Fold 0: MSE = 164971.78122909344, RMSE = 406.1671838407104, R2 = 0.8312750557862972
Fold 1: MSE = 61658.09857672998, RMSE = 248.31048825357735, R2 = 0.8956959074832139
Fold 2: MSE = 55896.10348556213, RMSE = 236.42356795709293, R2 = 0.9098274144967747
Fold 3: MSE = 131024.5457831534, RMSE = 361.9731285374004, R2 = 0.8703628951609286
Fold 4: MSE = 60384.74899318524, RMSE = 245.73308485669006, R2 = 0.9377089024060206
Average MSE: 94787.05561354483, Average RMSE: 299.72149068909425, Average R2: 0.888974035066647
--------Validation set results---------
Average MSE:123005.20152638089, Average RMSE:321.78147755855224, Average R2: 0.5954615626634397

The above results & logic demonstrates a time series cross-validation process for the XGBoost model applied to the target variable 'unleaded.' The results for each fold of the cross-validation include the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.

The average performance across all folds is also computed. The average MSE is 94,787.06, the average RMSE is 299.72, and the average R2 is 0.89, indicating the model's overall effectiveness in forecasting 'unleaded' across different time periods.

However, the 'Validation set results' section suggests that the model's performance varies across individual stores, as indicated by the considerably higher average MSE of 123,005.20, average RMSE of 321.78, and average R2 of 0.60 for the specific stores in the validation set. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'Unleaded' variable, with the lower average MSE and RMSE values and higher R2 indicating better performance on the training data compared to the validation set.

3.The Other Models

3.1. Prophet Model

In [ ]:
from sklearn.model_selection import train_test_split

# Define the ratio for the test data
test_ratio = 0.2  # You can adjust this ratio as needed

# Split the data into training and test sets
train_data, test_data = train_test_split(df_tsdm, test_size=test_ratio, random_state=42)

# Save the training and test data as CSV files if needed
train_data.to_csv('training_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

3.2 Total Inside Sales

In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet

# Assuming you have a DataFrame called train_data
inside_sales_train = train_data[['calendar.calendar_day_date', 'daily_yoy_ndt.total_inside_sales']]
inside_sales_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_inside_sales': 'y'}, inplace=True)


# Define and fit the Prophet model for inside sales
inside_sales_model = Prophet()
inside_sales_model.fit(inside_sales_train)

# Generate future date range for inside sales forecasting (365 days)
future_inside_sales = inside_sales_model.make_future_dataframe(periods=365)

# Make predictions for inside sales
forecast_inside_sales = inside_sales_model.predict(future_inside_sales)

# Access forecasted values for inside sales
forecasted_values_inside_sales = forecast_inside_sales[['ds', 'yhat']]

# Visualize the forecast and components for inside sales
fig_inside_sales = inside_sales_model.plot(forecast_inside_sales)
fig_inside_sales_components = inside_sales_model.plot_components(forecast_inside_sales)

# Set titles for the plots
fig_inside_sales.suptitle("Inside Sales Actual + Forecast for next 12 months")
fig_inside_sales_components.suptitle("Inside Sales Forecast Components")

# Assuming you have your actual values (ground truth) for the validation period
actual_values = test_data['daily_yoy_ndt.total_inside_sales'].tail(365).values

# Extract forecasted values for the validation period
forecasted_values = forecasted_values_inside_sales.tail(365)['yhat'].values

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)

print(f"-----------------------Features---------------------------")

print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-102-9738787503ad>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  inside_sales_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_inside_sales': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/kr0p04wz.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/g6x0y1g4.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=38787', 'data', 'file=/tmp/tmpcw905kue/kr0p04wz.json', 'init=/tmp/tmpcw905kue/g6x0y1g4.json', 'output', 'file=/tmp/tmpcw905kue/prophet_model7bh8affo/prophet_model-20231111232135.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:35 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:36 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features---------------------------
Mean Absolute Error- MAE: 909.1675562898364
Mean Squared Error- MSE: 1277009.5908604828
Root Mean Square Error- RMSE: 1130.0484904907767
R2:R-squared-0.2975971403214599

The Time series forecast indicates the Inside sales for the next 1 year¶

The three plots obtained are components of inside sales forecasting using the Prophet model. These plots help visualize the model's predictions and identify patterns and trends in the data.¶

Plot 1: Forecasted Values and Uncertainty Intervals¶

This plot shows the forecasted values for inside sales over the next 365 days, along with uncertainty intervals. The forecasted values represent the model's predictions for the 'daily_yoy_ndt.total_inside_sales' variable. The blue line represents the overall trend, and the shaded gray area represents the uncertainty intervals. The uncertainty intervals indicate the range within which the model is confident the actual values will fall.¶

Plot 2: Trend Component¶

This plot shows the trend component of the inside sales data. The trend component captures the overall long-term direction of the time series. It helps understand the underlying growth or decline in sales over time.¶

Plot 3: Seasonality Component¶

This plot shows the seasonality component of the inside sales data. The seasonality component captures the periodic fluctuations in sales that occur within a year. It helps identify seasonal patterns, such as increased sales during certain months or holidays.¶

Highest Inside sales are on Friday , with lowest sales on Sunday¶

Highest Inside sales are in August to September months, with lowest sales near January¶

By analyzing these plots, we can gain insights into the factors that influence inside sales and make informed decisions about sales strategies and resource allocation.¶

3.3 Total Food Service Sales

In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet

# Assuming you have a DataFrame called train_data
food_service_train = train_data[['calendar.calendar_day_date', 'daily_yoy_ndt.total_food_service']]
food_service_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_food_service': 'y'}, inplace=True)

# Define and fit the Prophet model for food service sales
food_service_model = Prophet()
food_service_model.fit(food_service_train)

# Generate future date range for food service sales forecasting (365 days)
future_food_service = food_service_model.make_future_dataframe(periods=365)

# Make predictions for food service sales
forecast_food_service = food_service_model.predict(future_food_service)

# Access forecasted values for food service sales
forecasted_values_food_service = forecast_food_service[['ds', 'yhat']]

# Visualize the forecast and components for food service sales
fig_food_service = food_service_model.plot(forecast_food_service)
fig_food_service_components = food_service_model.plot_components(forecast_food_service)

# Set titles for the plots
fig_food_service.suptitle("Food Service Sales Actual + Forecast for next 12 months")
fig_food_service_components.suptitle("Food Service Sales Forecast Components")

# Assuming you have your actual values (ground truth) for the validation period
actual_values = test_data['daily_yoy_ndt.total_food_service'].tail(365).values

# Extract forecasted values for the validation period
forecasted_values = forecasted_values_food_service.tail(365)['yhat'].values

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)

print(f"-----------------------Features---------------------------")

print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-103-c4403c841e4b>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  food_service_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_food_service': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/2s55jps9.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/gxoly90h.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=94984', 'data', 'file=/tmp/tmpcw905kue/2s55jps9.json', 'init=/tmp/tmpcw905kue/gxoly90h.json', 'output', 'file=/tmp/tmpcw905kue/prophet_modelr9sm3ily/prophet_model-20231111232141.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:41 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:43 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features---------------------------
Mean Absolute Error- MAE: 325.3608875609529
Mean Squared Error- MSE: 161153.13459088575
Root Mean Square Error- RMSE: 401.43883044728716
R2:R-squared-0.49217386879986

The Time series forecast indicates the Total Food sales for the next 1 year¶

The three plots obtained are components of Food sales forecasting using the Prophet model. These plots help visualize the model's predictions and identify patterns and trends in the data.¶

Plot 1: Forecasted Values and Uncertainty Intervals¶

This plot shows the forecasted values for Food sales over the next 365 days, along with uncertainty intervals. The forecasted values represent the model's predictions for the 'daily_yoy_ndt.total_food_service' variable. The blue line represents the overall trend, and the shaded blue area represents the uncertainty intervals. The uncertainty intervals indicate the range within which the model is confident the actual values will fall.¶

Plot 2: Trend Component¶

This plot shows the trend component of the Food sales data. The trend component captures the overall long-term direction of the time series. It helps understand the underlying growth or decline in sales over time.¶

Plot 3: Seasonality Component¶

This plot shows the seasonality component of the Food sales data. The seasonality component captures the periodic fluctuations in sales that occur within a year. It helps identify seasonal patterns, such as increased sales during certain months or holidays.¶

Highest Food sales are on Friday , with lowest sales on Sunday¶

Highest Food sales are in October month, with lowest sales near January¶

By analyzing these plots, we can gain insights into the factors that influence food sales and make informed decisions about sales strategies and resource allocation.¶

3.4 Total Diesel Sales

In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet

# Assuming we have a DataFrame called train_data
diesel_train = train_data[['calendar.calendar_day_date', 'diesel']]
diesel_train.rename(columns={'calendar.calendar_day_date': 'ds', 'diesel': 'y'}, inplace=True)

# Define and fit the Prophet model for diesel sales
diesel_model = Prophet()
diesel_model.fit(diesel_train)

# Generate future date range for diesel sales forecasting (365 days)
future_diesel = diesel_model.make_future_dataframe(periods=365)

# Make predictions for diesel sales
forecast_diesel = diesel_model.predict(future_diesel)

# Access forecasted values for diesel sales
forecasted_values_diesel = forecast_diesel[['ds', 'yhat']]

# Visualize the forecast and components for diesel sales
fig_diesel = diesel_model.plot(forecast_diesel)
fig_diesel_components = diesel_model.plot_components(forecast_diesel)

# Set titles for the plots
fig_diesel.suptitle("Diesel Sales Actual + Forecast for next 12 months")
fig_diesel_components.suptitle("Diesel Sales Forecast Components")

# Assuming we have your actual values (ground truth) for the validation period
actual_values = test_data['diesel'].tail(365).values

# Extract forecasted values for the validation period
forecasted_values = forecasted_values_diesel.tail(365)['yhat'].values

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)

print(f"-----------------------Features---------------------------")

print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-104-68463db684b8>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  diesel_train.rename(columns={'calendar.calendar_day_date': 'ds', 'diesel': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/koysj68s.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/bzwdsq9m.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=34478', 'data', 'file=/tmp/tmpcw905kue/koysj68s.json', 'init=/tmp/tmpcw905kue/bzwdsq9m.json', 'output', 'file=/tmp/tmpcw905kue/prophet_model9n_81_55/prophet_model-20231111232146.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:46 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:46 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features---------------------------
Mean Absolute Error- MAE: 1268.3773544811488
Mean Squared Error- MSE: 3139399.9493940254
Root Mean Square Error- RMSE: 1771.835192503531
R2:R-squared-0.06722174488757537

The Time series forecast indicates the Diesel sales for the next 1 year¶

The three plots obtained are components of Diesel sales forecasting using the Prophet model. These plots help visualize the model's predictions and identify patterns and trends in the data.¶

Plot 1: Forecasted Values and Uncertainty Intervals¶

This plot shows the forecasted values for Diesel sales over the next 365 days, along with uncertainty intervals. The forecasted values represent the model's predictions for the 'diesel' variable. The blue line represents the overall trend, and the shaded blue area represents the uncertainty intervals. The uncertainty intervals indicate the range within which the model is confident the actual values will fall.¶

Plot 2: Trend Component¶

This plot shows the trend component of the Diesel sales data. The trend component captures the overall long-term direction of the time series. It helps understand the underlying growth or decline in sales over time.¶

Plot 3: Seasonality Component¶

This plot shows the seasonality component of the Diesel sales data. The seasonality component captures the periodic fluctuations in sales that occur within a year. It helps identify seasonal patterns, such as increased sales during certain months or holidays.¶

Highest Diesel sales are on Wednesday , with lowest sales on Sunday¶

Highest Diesel sales are in July month, with lowest sales near January¶

By analyzing these plots, we can gain insights into the factors that influence Diesel sales and make informed decisions about sales strategies and resource allocation.¶

3.5 Total Unleaded Sales

In [ ]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet

# Assuming you have a DataFrame called train_data
unleaded_train = train_data[['calendar.calendar_day_date', 'unleaded']]
unleaded_train.rename(columns={'calendar.calendar_day_date': 'ds', 'unleaded': 'y'}, inplace=True)

# Define and fit the Prophet model for unleaded sales
unleaded_model = Prophet()
unleaded_model.fit(unleaded_train)

# Generate future date range for unleaded sales forecasting (365 days)
future_unleaded = unleaded_model.make_future_dataframe(periods=365)

# Make predictions for unleaded sales
forecast_unleaded = unleaded_model.predict(future_unleaded)

# Access forecasted values for unleaded sales
forecasted_values_unleaded = forecast_unleaded[['ds', 'yhat']]

# Visualize the forecast and components for unleaded sales
fig_unleaded = unleaded_model.plot(forecast_unleaded)
fig_unleaded_components = unleaded_model.plot_components(forecast_unleaded)

# Set titles for the plots
fig_unleaded.suptitle("Unleaded Sales Actual + Forecast for next 12 months")
fig_unleaded_components.suptitle("Unleaded Sales Forecast Components")

# Assuming you have your actual values (ground truth) for the validation period
actual_values = test_data['unleaded'].tail(365).values

# Extract forecasted values for the validation period
forecasted_values = forecasted_values_unleaded.tail(365)['yhat'].values

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)

print(f"-----------------------Features---------------------------")

print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-105-60cd7a72131b>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unleaded_train.rename(columns={'calendar.calendar_day_date': 'ds', 'unleaded': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/ewo1qi1w.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/d05oa08l.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=5696', 'data', 'file=/tmp/tmpcw905kue/ewo1qi1w.json', 'init=/tmp/tmpcw905kue/d05oa08l.json', 'output', 'file=/tmp/tmpcw905kue/prophet_modelmx3v81kx/prophet_model-20231111232149.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:49 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:50 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features---------------------------
Mean Absolute Error- MAE: 793.8829961059837
Mean Squared Error- MSE: 1090062.143175371
Root Mean Square Error- RMSE: 1044.060411650289
R2:R-squared-0.0944194610733573

The Time series forecast indicates the Unleaded sales for the next 1 year¶

The three plots obtained are components of Unleaded sales forecasting using the Prophet model. These plots help visualize the model's predictions and identify patterns and trends in the data.¶

Plot 1: Forecasted Values and Uncertainty Intervals¶

This plot shows the forecasted values for Unleaded sales over the next 365 days, along with uncertainty intervals. The forecasted values represent the model's predictions for the 'Unleaded' variable. The blue line represents the overall trend, and the shaded blue area represents the uncertainty intervals. The uncertainty intervals indicate the range within which the model is confident the actual values will fall.¶

Plot 2: Trend Component¶

This plot shows the trend component of the Unleaded sales data. The trend component captures the overall long-term direction of the time series. It helps understand the underlying growth or decline in sales over time.¶

Plot 3: Seasonality Component¶

This plot shows the seasonality component of the Unleaded sales data. The seasonality component captures the periodic fluctuations in sales that occur within a year. It helps identify seasonal patterns, such as increased sales during certain months or holidays.¶

Highest Unleaded sales are on Friday , with lowest sales on Sunday¶

Highest Unleaded sales are in the month of August, with lowest sales near January¶

By analyzing these plots, we can gain insights into the factors that influence Unleaded sales and make informed decisions about sales strategies and resource allocation.¶

In [ ]:
from tabulate import tabulate

data = [
    ['daily_yoy_ndt.total_inside_sales', '911.23','1280792.08', '1131.72'],
    ['daily_yoy_ndt.total_food_service', '325.36', '161153.13', '401.43'],
    ['Diesel', '1268.37', '3139399.94', '1771.83',],
    ['Unleaded', '793.88', '1090062.14', '1044.06']
]

title = "Prophet Best Model Results"


print("\n" + title)
print(tabulate(data, headers=['Output Variables','MAE', 'MSE', 'RMSE'], tablefmt='fancy_grid'))
Prophet Best Model Results
╒══════════════════════════════════╤═════════╤══════════════════╤═════════╕
│ Output Variables                 │     MAE │              MSE │    RMSE │
╞══════════════════════════════════╪═════════╪══════════════════╪═════════╡
│ daily_yoy_ndt.total_inside_sales │  911.23 │      1.28079e+06 │ 1131.72 │
├──────────────────────────────────┼─────────┼──────────────────┼─────────┤
│ daily_yoy_ndt.total_food_service │  325.36 │ 161153           │  401.43 │
├──────────────────────────────────┼─────────┼──────────────────┼─────────┤
│ Diesel                           │ 1268.37 │      3.1394e+06  │ 1771.83 │
├──────────────────────────────────┼─────────┼──────────────────┼─────────┤
│ Unleaded                         │  793.88 │      1.09006e+06 │ 1044.06 │
╘══════════════════════════════════╧═════════╧══════════════════╧═════════╛

4.0 Linear Regression

One of the models we tried was Linear Regresion.This model was used because it can be used to forecast the dependent variable's future values based on the independent variable's values. Also, it can be really useful when trying to predict sales. The initial Linear Regression model that was made used all of the independent variables. Then Ridge regression was used to identify features with high VIF values to then remove them from the model. Therefore, the second model created only included the food variables identified from the Ridge Regression. The second model yielded the best results and its results can be seen below.

In [ ]:
from tabulate import tabulate

data = [
    ['daily_yoy_ndt.total_inside_sales', '308684.84', '555.59','0.66'],
    ['daily_yoy_ndt.total_food_service', '55279.02', '235.11', '0.48'],
    ['Diesel', '19636308.01', '4431.29', '-0.34'],
    ['Unleaded', '761533.91', '872.66', '-0.04']
]

title = "Linear Regression Best Model Results"


print("\n" + title)
print(tabulate(data, headers=['Output Variables', 'MSE', 'RMSE', 'R-Squared'], tablefmt='fancy_grid'))
Linear Regression Best Model Results
╒══════════════════════════════════╤══════════════════╤═════════╤═════════════╕
│ Output Variables                 │              MSE │    RMSE │   R-Squared │
╞══════════════════════════════════╪══════════════════╪═════════╪═════════════╡
│ daily_yoy_ndt.total_inside_sales │ 308685           │  555.59 │        0.66 │
├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤
│ daily_yoy_ndt.total_food_service │  55279           │  235.11 │        0.48 │
├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤
│ Diesel                           │      1.96363e+07 │ 4431.29 │       -0.34 │
├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤
│ Unleaded                         │ 761534           │  872.66 │       -0.04 │
╘══════════════════════════════════╧══════════════════╧═════════╧═════════════╛

4.1 Random Forest

One of the models we tried was Random Forest. Random Forest is a good model to try because it is an ensemble of decision trees which leads to it having lower variance than other machine learning algorithms and it can produce better results. The model can also improve time series predictions by resolving biases in the model that might exist. The initial Random Forest model created was a regular Random Forest model and had n_estimators=100 and random_state=40. The second model created used hyperparameter tuning to try and yield better results. The second model did yield better results and its results can be seen below.

In [ ]:
from tabulate import tabulate

data = [
    ['daily_yoy_ndt.total_inside_sales', '621285.06', '788.22','0.26'],
    ['daily_yoy_ndt.total_food_service', '47230.02', '217.32', '0.57'],
    ['Diesel', '1505394.43', '1226.95', '0.33'],
    ['Unleaded', '3582945.21', '1892.87', '-0.78']
]

title = "Random Forest Best Model Results"


print("\n" + title)
print(tabulate(data, headers=['Output Variables', 'MSE', 'RMSE', 'R-Squared'], tablefmt='fancy_grid'))
Random Forest Best Model Results
╒══════════════════════════════════╤══════════════════╤═════════╤═════════════╕
│ Output Variables                 │              MSE │    RMSE │   R-Squared │
╞══════════════════════════════════╪══════════════════╪═════════╪═════════════╡
│ daily_yoy_ndt.total_inside_sales │ 621285           │  788.22 │        0.26 │
├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤
│ daily_yoy_ndt.total_food_service │  47230           │  217.32 │        0.57 │
├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤
│ Diesel                           │      1.50539e+06 │ 1226.95 │        0.33 │
├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤
│ Unleaded                         │      3.58295e+06 │ 1892.87 │       -0.78 │
╘══════════════════════════════════╧══════════════════╧═════════╧═════════════╛

4.2 SARIMA

SARIMA (Seasonal Autoregressive Integrated Moving Average) models are particularly useful for time series forecasting tasks, especially when the data exhibit a clear seasonality pattern. It also captures both the trend and seasonal patterns in the data and capable of modeling the seasonal component, which is essential for forecasting tasks where the data exhibits regular patterns over time.

A grid search is performed using auto_arima to find the best SARIMA parameters for the training data. The best order and seasonal order obtained from the grid search are used to fit a SARIMA model. Predictions are made on the testing data, RMSE (Root Mean Square Error), MSE (Mean Square Error), and R-Squared, are calculated to evaluate the model's performance for each target variable.

In [ ]:
from tabulate import tabulate

data = [
    ['daily_yoy_ndt.total_inside_sales', '1514448.86', '1230.63'],
    ['daily_yoy_ndt.total_food_service', '180738.84', '425.13'],
    ['Diesel', '16674629.69', '4083.46'],
    ['Unleaded', '863498.96', '929.25']
]

title = "SARIMA Best Model Results"


print("\n" + title)
print(tabulate(data, headers=['Output Variables', 'MSE', 'RMSE', 'R-Squared'], tablefmt='fancy_grid'))
SARIMA Best Model Results
╒══════════════════════════════════╤══════════════════╤═════════╕
│ Output Variables                 │              MSE │    RMSE │
╞══════════════════════════════════╪══════════════════╪═════════╡
│ daily_yoy_ndt.total_inside_sales │      1.51445e+06 │ 1230.63 │
├──────────────────────────────────┼──────────────────┼─────────┤
│ daily_yoy_ndt.total_food_service │ 180739           │  425.13 │
├──────────────────────────────────┼──────────────────┼─────────┤
│ Diesel                           │      1.66746e+07 │ 4083.46 │
├──────────────────────────────────┼──────────────────┼─────────┤
│ Unleaded                         │ 863499           │  929.25 │
╘══════════════════════════════════╧══════════════════╧═════════╛

5. Summary

Across the different forecasting models for daily year-over-year inside sales, food service sales, diesel, and unleaded fuel sales, the models demonstrate varying degrees of predictive accuracy, with R2 scores ranging from moderate to good. The models for inside sales and food service sales show relatively high R2 values, indicating that they are able to explain a substantial portion of the variance in the respective sales figures. The unleaded sales model also reflects a good fit, with its R2 score illustrating that it captures the variance well across the stores. However, the diesel sales model exhibits a lower average R2, suggesting a weaker predictive capability.

The RMSE values across all models show that there is room for improvement in the prediction accuracy, with particular attention needed for specific stores where the predictions were less accurate. Notably, the RMSE for diesel sales is significantly higher than for other categories, indicating larger discrepancies between the predicted and actual sales.

The wide range of MSE and RMSE values across stores for each model indicates inconsistencies in model performance, which could be due to store-specific factors not accounted for in the models.

6. Conclusion

Implementing the XGBoost Regressor to forecast the target features by predicting test stores individually as well as applying cross validations we observe that the Average RMSE and R2 score in both cases are almost similar.

a. daily_yoy_ndt.total_inside_sales:

Average MSE: 260221.35414967436, Average RMSE: 443.231123312014, Average R2: 0.67289066188858668

b. daily_yoy_ndt.total_food_service:

Average MSE: 25179.288356393925, Average RMSE: 133.4489969024974, Average R2: 0.7005246058034572

c. diesel: Average MSE: 145828.62205081654, Average RMSE: 355.41704329040044, Average R2: 0.908387804544341

d. unleaded: Average MSE: 94787.05561354483, Average RMSE: 299.72149068909425, Average R2: 0.888974035066647

In conclusion, while the models are generally effective at forecasting sales across different categories, the results highlight the possibility of overfitting and the need for further refinement. This could include incorporating additional location-specific variables, adjusting model parameters, or exploring different modeling techniques to improve predictive accuracy and consistency across all stores. The disparities in model performance suggest that a one-size-fits-all model may not be optimal and that a more customized approach may be necessary to address the nuances of each store's sales dynamics.

7. Contributions

7.1. Daryle Bilog

I created the SARIMA model and used auto grid search to get the best result on this model.

7.2 Joe Sarnello

I created the Linear Regression and Random Forest models and provided their descriptions and results.

7.3. Sanskriti Bhargava

I performed descriptive and statistical analysis on the time-series and qualitative data. I treated missing values in the categorical data by creating new 'None' category . Moreover, I detected and treated the outliers for each target variable by using the IQR Method. I even performed the stationarity test for each target variable and created Autocorrelation plots for analysis. I performed feature engineering by creating new date features and also created lag and rolling features. Further, I encoded the categorical variables using One-Hot Encoding and split the data into train and test based on store ids. Finally, I trained, tested and performed forecasting using the XGBRegressor.

7.4. Vinay Kumar Vascuri

I created ARIMA, Prophet Models. Presented the results of Prophet Model along with visualizations, provided the write up for Introduction, summary of the visualizations and the logics.